Deep fragment embeddings for bidirectional image sentence mapping

Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015 ICML)

  • Feature Map after Front CNNs: 14x14x512 (widthxheightxchannel)
    • each 1x1x512 is $\{a_i\}$
  • Implementation Detail by checking source code
    • $f_{att}$: 2 layer-MLP
      • $f_{att}(\mathbf{a_i}, \mathbf{h}_{t-1})$ 's implementation
        • intermediate layers : $\mathbf{ctx} = \mathbf{W}_{D->D}\mathbf{a_i} + \mathbf{bias}$
        • final layer: $\mathbf{W}_{D->1}(\mathbf{ctx} + \mathbf{W}_{n->D}\mathbf{h}_{t-1}) + \mathbf{bias}$
    • $f_{init,c}$: 2 layer-MLP
    • $f_{init,h}$: 2 layer-MLP
  • Source Code

Deep Visual-Semantic Alignments for Generating Image Descriptions

DenseCap: Fully Convolutional Localization Networks for Dense Captioning